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Principal component analysis (PCA) is a classical method for 
dimensionality reduction based on extracting the dominant eigenvec- 
tors of the sample covariance matrix. However, PCA is well known to 
behave poorly in the "large p, small n" setting, in which the problem 
dimension p is comparable to or larger than the sample size n. This 
paper studies PCA in this high-dimensional regime, but under the 
additional assumption that the maximal eigenvector is sparse, say, 
with at most k nonzero components. We consider a spiked covari- 
ance model in which a base matrix is perturbed by adding a fc-sparse 
maximal eigenvector, and we analyze two computationally tractable 
methods for recovering the support set of this maximal eigenvec- 
tor, as follows: (a) a simple diagonal thresholding method, which 
transitions from success to failure as a function of the rescaled sam- 
ple size 9^ la (n,p,k) = n/[k 2 log(p — k)]; and (b) a more sophisticated 
semidefinite programming (SDP) relaxation, which succeeds once the 
rescaled sample size 8 s d p (n,p,k) = n/[fc log(p — k)] is larger than a 
critical threshold. In addition, we prove that no method, including 
the best method which has exponential-time complexity, can succeed 
in recovering the support if the order parameter 9 B< i p (n,p,k) is be- 
low a threshold. Our results thus highlight an interesting trade-off 
between computational and statistical efficiency in high-dimensional 
inference. 

1. Introduction. Principal component analysis (PCA) is a classical method 
[1, 22] for reducing the dimension of data, say, from some high-dimensional 
subset of W down to some subset of M. d , with d -Cp. Principal component 
analysis operates by projecting the data onto the d directions of maximal 



Received March 2008; revised August 2008. 

Supported in part by NSF Grants CAREER-CCF-05-45862 and DMS-06-05165, and 
a Sloan Foundation Fellowship. 

AMS 2000 subject classifications. Primary 62H25; secondary 62F12. 

Key words and phrases. Principal component analysis, spectral analysis, spiked covari- 
ance ensembles, sparsity, high-dimensional statistics, convex relaxation, semidefinite pro- 
gramming, Wishart ensembles, random matrices. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2009, Vol. 37, No. 5B, 2877-2921. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



A. A. AMINI AND M. J. WAINWRIGHT 



variance, as captured by eigenvectors of the pxp population covariance ma- 
trix E. Of course, in practice, one does not have access to the population 
covariance, but instead must rely on a "noisy" version of the form 

(1) £ = £ + A, 

where A = A n denotes a random noise matrix, typically arising from having 
only a finite number n of samples. A natural question in assessing the perfor- 
mance of PCA is under what conditions the sample eigenvectors (i.e., based 
on X) are consistent estimators of their population analogues. In the classi- 
cal theory of PCA, the model dimension p is viewed as fixed, and asymptotic 
statements are established as the number of observations n tends to infinity. 
With this scaling, the influence of the noise matrix A dies off, so that sam- 
ple eigenvectors and eigenvalues are consistent estimators of their population 
analogues [1]. However, such "fixed p, large n" scaling may be inappropri- 
ate for many contemporary applications in science and engineering (e.g., 
financial time series, astronomical imaging, sensor networks), in which the 
model dimension p is comparable or even larger than the number of observa- 
tions n. This type of high-dimensional scaling causes dramatic breakdowns 
in standard PCA and related eigenvector methods, as shown by classical 
and ongoing work in random matrix theory [13, 20, 21]. 

Without further restrictions, there is little hope of performing high-dimensional 
inference with very limited data. However, many data sets exhibit additional 
structure, which can partially mitigate the curse of dimensionality. One nat- 
ural structural assumption is that of sparsity, and various types of sparse 
models have been studied in past statistical work. There is a substantial 
and on-going line of work on subset selection and sparse regression models 
(e.g., [6, 11, 28, 35, 36]), focusing in particular on the behavior of various 
^i-based relaxation methods. Other work has tackled the problem of es- 
timating sparse covariance matrices in the high-dimensional setting, using 
thresholding methods [3, 12] as well as ^i-regularization methods [8, 39]. 

A related problem — and the primary focus of this paper — is recovering 
sparse eigenvectors from high-dimensional data. While related to sparse co- 
variance estimation, the sparse eigenvector problem presents a different set 
of challenges; indeed, a covariance matrix may have a sparse eigenvector 
with neither it nor its inverse being a sparse matrix. Various researchers 
have proposed methods for extracting sparse eigenvectors, a problem often 
referred to as sparse principal component analysis (SPCA). Some of these 
methods are based on greedy or nonconvex optimization procedures (e.g., 
[23, 29, 40]), whereas others are based on various types of £i-regularization 
[9, 41]. Zou, Hastie and Tibshirani [41] develop a method based on transform- 
ing the PCA problem to a regression problem and then applying the Lasso 
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(£i-regularization). Johnstone and Lu [21] proposed a two-step method, us- 
ing an initial pre-processing step to select relevant variables followed by or- 
dinary PCA in the reduced space. Under a particular £ q -ball sparsity model, 
they proved ^-consistency of their procedure as long as p/n converges to 
a constant. In recent work, d'Aspremont et al. [9] have formulated a di- 
rect semidefinite programming (SDP) relaxation of the sparse eigenvector 
problem, and developed fast algorithms for solving it, but have not pro- 
vided high-dimensional consistency results. The elegant work of Paul and 
Johnstone [30, 32], brought to our attention after initial submission, studies 
estimation of eigenvectors satisfying weak £ g -ball sparsity assumptions for 
q £ (0,2). We discuss connections to this work at more length below. 

In this paper, we study the model selection problem for sparse eigenvec- 
tors. More precisely, we consider a spiked covariance model [20], in which the 
maximal eigenvector z* of the population covariance E p £ M. pxp is fc-sparse, 
meaning that it has nonzero entries on a subset S(z*) with cardinality k, 
and our goal is to recover this support set exactly. In order to do so, we have 
access to a matrix E, representing a noisy version of the population covari- 
ance, as in (1). Although our theory is somewhat more generally applicable, 
the most natural instantiation of E is as a sample covariance matrix based 
on n i.i.d. samples drawn from the population. We analyze this setup in the 
high-dimensional regime, in which all three parameters — the number of ob- 
servations n, the ambient dimension p and the sparsity index k — are allowed 
to tend to infinity simultaneously. Our primary interest is in the following 
question: using a given inference procedure, under what conditions on the 
scaling of triplet (n,p,k) is it possible, or conversely impossible, to recover 
the support set of the maximal eigenvector z* with probability one? 

We provide a detailed analysis of two procedures for recovering sparse 
eigenvectors, as follows: (a) a simple diagonal thresholding method, used 
as a pre-processing step by Johnstone and Lu [21], and (b) a semidefi- 
nite programming (SDP) relaxation for sparse PCA, recently developed by 
d'Aspremont et al. [9]. Under the fc-sparsity assumption on the maximal 
eigenvector, we prove that the success or failure probabilities of these two 
methods have qualitatively different scaling in terms of the triplet (n,p,k). 
For the diagonal thresholding method, we prove that its success or failure is 
governed by the rescaled sample size 

Tl 

(2) diSL (n,p,k): 



k 2 log(p — k) 



meaning that it succeeds with probability one for scalings of the triplet 
(n,p, k) such that #dia is above some critical value and, conversely, fails with 
probability one when this ratio falls below some critical value. We then 
establish performance guarantees for the SDP relaxation [9]. In particular, 
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for the same class of models, we show that it always has a unique rank- 
one solution that specifies the correct signed support once Qd\a.( n -,P-,k) is 
sufficiently large, moreover, that for sufficiently large values of the rescaled 
sample size 

71 

(3) e ^ n > p > k): =kio g ( P - k y 

if there exists a rank-one solution, then it specifies the correct signed support. 
The proof of this result is based on random matrix theory, concentration of 
measure and Gaussian comparison inequalities. Our final contribution is to 
use information-theoretic arguments to show that no method can succeed in 
recovering the signed support for the spiked identity covariance model if the 
order parameter 9 s d p (n,p, k) lies below some critical value. One consequence 
is that the given scaling (3) for the SDP relaxation is sharp, meaning the SDP 
relaxation also fails once # s d p drops below a critical threshold. Moreover, it 
shows that under the rank-one condition, the SDP is in fact statistically 
optimal, that is, it requires only the necessary number of samples (up to a 
constant factor) to succeed. 

The results reported here are complementary to those of Paul and John- 
stone [30, 32], who propose and analyze the augmented SPCA algorithm for 
estimating eigenvectors. In comparison to the models analyzed here, their 
analysis applies to spiked models using the identity base covariance, but it al- 
lows for m > 1 eigenvectors in the spiking. In addition, they consider the class 
of weak £ g -ball sparsity models, as opposed to the hard £o-sparsity model 
considered here. Another difference is that their results provide guarantees 
in terms of the ^-norm between the eigenvector and its estimate, whereas 
our results guarantee exact support recovery. We note that an estimate can 
be close in ^-norm while having a very different support set. Consequently, 
the results given here, which provide conditions for exact support recovery, 
provide complementary insight. 

Our results highlight some interesting trade-offs between computational 
and statistical costs in high-dimensional inference. On one hand, the sta- 
tistical efficiency of SDP relaxation is substantially greater than the diag- 
onal thresholding method, requiring 0(1/ k) fewer observations to succeed. 
However, the computational complexity of SDP is also larger by roughly a 
factor 0(p 3 ). An implementation due to d'Aspremont et al. [9] has com- 
plexity 0(np + p 4 log p) as opposed to the 0(np + p\ogp) complexity of the 
diagonal thresholding method. Moreover, our information-theoretic analysis 
shows that the best possible method — namely, one based on an exhaustive 
search over all (?) subsets, with exponential complexity — does not have sub- 
stantially greater statistical efficiency than the SDP relaxation. 

The remainder of this paper is organized as follows. In Section 2, we 
provide precise statements of our main results, discuss some of their impli- 
cations and provide simulation results to illustrate the sharpness of their 
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predictions. Sections 3, 4 and 5 are devoted to proofs of these results, with 
some of the more technical aspects deferred to appendices. We conclude in 
Section 6. 

1.1. Notation. For the reader's convenience, we state here some notation 
used throughout the paper. For a vector x G M n , we use ||sc||p = (X)it=i 
to denote its i^-norm. For a matrix A S R mxn , we use |||-A|||p, 9 to denote the 
matrix operator norm induced by vector norms £ p and £ q ; more precisely, 
we have 

(4) |||A||| M := max \\Ax\\ p . 

\\x\\ q =l 

A few cases of particular interest in this paper are (a) the spectral norm 
given by 

P||| 2 , 2 := . max {^(A)}, 

1=1,. ..,m 

where {ai(A)} are the singular values of A, and the £ 00 -operator norm, given 
by 

n 

IH^IIloo ,oo := max V \Aij\. 

1=1,. ...m — * 

Given two square matrices 1,7 6 W ixn , we define the matrix inner product 
((X, Y)) := tr(Xy T ) = XijYij- Note that this inner product induces the 
Hilbert-Schmidt norm |||A||| H s = \/{{X,X)). 

We use the following standard asymptotic notation: for functions /, g, 
the notation f(n) = 0(g(n)) means that there exists a fixed constant < 
C < +oo such that f(n) < Cg(n); the notation f(n) = Q(g(n)) means that 
f(n) > Cg(n), and f(n) = Q(g(n)) means that f(n) = 0(g(n)) and f(n) = 
Q(g(n)). Note in particular that when used without a subscript "p," these 
symbols are to be interpreted in a deterministic sense, that is, the constants 
involved are assumed to be nonrandom. 

We use \(A) to denote a generic eigenvalue of a square matrix A, as 
well as A m i n (-) and A max ( - ) for the minimal and the maximal eigenvalues, 
respectively. Any member of the set of eigenvectors of A associated with 
an eigenvalue is denoted as v(A). Thus, v max (-), for example, represents 
the eigenvectors associated with the maximal eigenvalue (occasionally re- 
ferred to as "maximal eigenvectors"). We always assume that eigenvectors 
are normalized to unit ^-norm and have a nonnegative first component. The 
sign convention guarantees uniqueness of the eigenvector associated with an 
eigenvalue with geometric multiplicity one. 

Finally, some probabilistic notation: we say a sequence of events {Ej}j>i 
happens with asymptotic probability one (w.a.p. one) if limj^ +00 P[-Ej] = 
1, whereas it holds asymptotically almost surely (a.a.s.) as j — > +oo if 
P(liminf J E i ) = I. 
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2. Main results and consequences. The primary focus of this paper is the 
spiked covariance model, in which some base covariance matrix is perturbed 
by the addition of a sparse eigenvector z* E M p . In particular, we study 
sequences of covariance matrices of the form 

(5) Z p = (3z*z* T + 

where G §+ _fc is a symmetric PSD matrix with A max (iy_fc) < 1. Note 
that we have assumed (without loss of generality, by re-ordering the indices 
as necessary) that the nonzero entries of z* are indexed by {1, so 
that (5) is the form of the covariance after any re-ordering. We also assume 
that the nonzero part of z* has entries z* G ^={— 1, +1}, so that ||z*||2 = 1- 
The spiked covariance model (5) was first proposed by Johnstone [20], who 
focused on the spiked identity covariance matrix [i.e., model (5) with r p _^ = 
Ip—k]- Johnstone and Lu [21] established that the sample eigenvectors for the 
spiked identity model, based on a set of n i.i.d. samples with distribution 
N(0, E p ) from the spiked identity ensemble, are inconsistent as estimators 
of z* whenever p/n — > c > 0. These asymptotic results were refined by later 
work [2, 31]. 

In this paper, we study a slightly more general family of spiked covari- 
ance models, in which the matrix T p _i : is required to satisfy the following 
conditions: 

(6a) Al. 111^7^^ = 0(1) and 

(6b) A2. A m ax(iy_fc) < minj 1, A min (IV fc ) + - 

Here y/T p -k denotes the symmetric square root. These conditions are triv- 
ially satisfied by the identity matrix Ip_&, but also can hold for more general 
nondiagonal matrices. Thus, under the model (5), the population covari- 
ance matrix E itself need not be sparse, since (at least generically) it has 
k 2 + (p — k) 2 = @(p 2 ) nonzero entries. Assumption (A2) on the eigenspectrum 
of the matrix ensures that as long as (3 > 0, then the vector z* is the 
unique maximal eigenvector of E, with associated eigenvalue (1 + 13). Since 
the remaining eigenvalues are bounded above by 1, the parameter (3 > 
represents a signal-to-noise ratio, characterizing the separation between the 
maximal eigenvalue and the remainder of the eigenspectrum. Assumption 
(Al) is related to the fact that recovering the correct signed support means 
that the estimate z must satisfy \\z — z*\\ OQ < \ j\fk. As will be clarified by 
our analysis (see Section 4.4), controlling this i^-norm requires bounds on 
terms of the form ||y / T i ^Zfcit]| 00 , which requires control of the ^co-operator 
norm ||| - v /r p _ fe ||| 00i00 . 
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In this paper, we study the model selection problem for eigenvectors: that 
is, we assume that the maximal eigenvector z* is /c-sparse, meaning that 
it has exactly k nonzero entries, and our goal is to recover this support, 
along with the sign of z* on its support. We let S{z*) = {i \ z* / 0} denote 
the support set of the maximal eigenvector; recall that S(z*) = {l,...,k} 
by our assumed ordering of the indices. Moreover, we define the function 
5±:RP->{-l,0,+l}P by 



(7) [S±(u)]i:={ 



sign(tti), ifni/0, 
0, otherwise, 



so that S±(z*) encodes the signed support of the maximal eigenvector. 

Given some estimate S± of the true signed support S±(z*), we assess it 
based on the 0-1 loss I[S± ^ S±(z*)], so that the associated risk is simply the 
probability of incorrect decision P[«S± 7^ S±(z*)]. Our goal is to specify con- 
ditions on the scaling of the triplet (n,p,k) such that this error probability 
vanishes, or conversely, fails to vanish asymptotically. We consider methods 
that operate based on a set of n samples x l ,...,x n , drawn i.i.d. with dis- 
tribution N(0, E p ). Under the spiked covariance model (5), each sample can 
be written as 

(8) x l = ^(3v i z* + VTg l , 

where VT is the symmetric matrix square root. Here v l ~ iV(0, 1) is standard 
Gaussian, and g l ~ N(0,I pX p) is a standard Gaussian p- vector, independent 
of v l , so that VTg { ~ N(0, T). The data {^}™ =1 defines the sample covariance 
matrix 

n 

i\T 



- 1 

(9) 



x 

n ■'. 



which follows a p-variate Wishart distribution [1]. In this paper, we analyze 
the high-dimensional scaling of two methods for recovering the signed sup- 
port of the maximal eigenvector. It will be assumed throughout that the size 
k of the support of z* is available to the methods a priori, that is, we do not 
make any attempt at estimating k. 



2.1. Diagonal thresholding method. Under the spiked covariance model 
(5), note that the diagonal elements of the population covariance satisfy 
S« = 1 + (3/k for all i G S, and E u < 1 for all £ £ S. (This latter bound 

follows since for all i ^ S, we have < ||| fc III 2,2 ^ 1-) This observation 

motivates a natural approach to recovering information about the support 
set S, previously used as a pre-processing step by Johnstone and Lu [21]. 
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Let Di,£= l,...,p, be the diagonal elements of the sample covariance 
matrix — namely, 



1 n 



^ = -£(4) 2 = Pk 



n . 
i=i 

Form the associated order statistics 

£>(!)< £> (2 )< ■■■<D {p _ l) <D {p) , 

and output the random subset S(D) of cardinality k specified by the in- 
dices of the largest k elements {D^ p _ k+ i^, . . . ,D^}. The chief appeal of this 
method is its low computational complexity. Apart from the order 0(np) of 
computing the diagonal elements of S, it requires only performing a sorting 
operation, with complexity 0(plogp). 

Note that this method provides only an estimate of the support S(z*), 
as opposed to the signed support S±(z*). One could imagine extending the 
method to extract sign information as well, but our main interest in studying 
this method is to provide a simple benchmark by which to calibrate our 
later results on the performance of the more complex SDP relaxation. In 
particular, the following result provides a precise characterization of the 
statistical behavior of the diagonal thresholding method. 

Proposition 1 (Performance of diagonal thresholding). For k = 0(p 1 ~ s ) 
for any 5 S (0, 1), the probability of successful recovery using diagonal thresh- 
olding undergoes a phase transition as a function of the rescaled sample size 

n 



(10) e dia (n,p,k) 



k 2 log(p — k) 



More precisely, there exists a constant 6 U such that if n> 9 u k 2 log(p — k), 
then 

(11) P[S(D) = S(z*)} > 1 - exp(-9(A; 2 log(p - fc))) -> 1, 

so that the method succeeds w.a.p. one and a constant 0£ > such that if 
n < 6gk 2 log(p — k), then 

(12) P[S(D) = S{z*)} < exp(-6(log(p - k))) -> 0, 
so that the method fails w.a.p. one. 



Remarks. The proof of Proposition 1, provided in Section 3, is based 
on large deviations bounds on x 2 - v ariates. The achievability assertion (11) 
uses known upper bounds on the tails of x 2 - varia t es ( e -g-> [4, 21]). The 
converse result (12) requires an exponentially tight lower bound on the tails 
of x 2 -variates, which we derive in Appendix C. 



ANALYSIS OF SEMIDEFINITE RELAXATIONS 



9 



To illustrate the prediction of Proposition 1, we provide some results 
on the diagonal thresholding method. For all experiments reported here, 
we generated n samples {x 1 ,...,£ n } in an i.i.d. manner from the spiked 
covariance ensemble (5), with V = I and (3 = 3. Figure 1 illustrates the be- 
havior predicted by Proposition 1. Each panel plots the success probability 
F[S(D) = S(z*)] versus the rescaled sample size 6> dia (n,p, n) = n/[k 2 log(p — 
k)}. Each panel shows five model dimensions (p 6 {100, 200, 300, 600, 1200}), 
with panel (a) showing the logarithmic sparsity index k = 0(\ogp) and panel 
(b) showing the case k = 0(^/p). Each point on each curve corresponds to 
the average of 100 independent trials. As predicted by Proposition 1, the 
curves all coincide, even though they correspond to very different regimes of 
(p,k). 

2.2. Semidefinite-programming relaxation. We now describe the approach 
to sparse PCA developed by d'Aspremont et al. [9]. Let = {Z e W xp \ 
Z = Z T ,Zy0} denote the cone of symmetric, positive semidefinite (PSD) 
matrices. Given n i.i.d. observations from the model N(0,T, p ), let £ be the 
sample covariance matrix (9), and let p n > be a user-defined regulariza- 
tion parameter. d'Aspremont et al. [9] propose estimating z* by solving the 
optimization problem 



(13) Z:=argmax 



tr(SZ) - p n ^2\Zi 



i.j 



s.t. tr(Z) = 1, 



Diagonal thresholding: k = O(log(p)) 





















/ 






-»-p = 100 

— p = 2M 
-■-p = 300 

— p = @0O 
■*■ p= 1K0O 









°0 5 10 15 

Control parameter 



(a) 



Diagonal thresholding (k = 0(sqrt(p)) 




5 10 15 

Control parameter 



(b) 



Fig. 1. Plot of the success probability ¥[S(D) = S(z*)] versus the rescaled sample size 
8dia.(n,p, k) — n/[k 2 log(p — k)] . The five curves in each panel correspond to model dimen- 
sions p £ {100, 200, 300, 600, 1200}, SNR parameter (3 — 3 and sparsity indices k — O(logp) 
in panel (a) and k = 0(^/p) in panel (b). As predicted by Proposition 1, the success prob- 
ability undergoes a phase transition, with the curves for different model sizes and different 
sparsity indices all lying on top of one another. 
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and computing the maximal eigenvector 2" = v max (Z). The optimization 
problem (13) is a semidefinite program (SDP), a class of convex conic pro- 
grams that can be solved exactly in polynomial time. Indeed, d'Aspremont 
et al. [9] describe an 0(p 4 logp) algorithm, with an implementation posted 
online, that we use for all simulations reported in this paper. 

To gain some intuition for the SDP relaxation (13), recall the following 
Courant-Fischer variational representation [18] of the maximal eigenvalue 
and eigenvector: 

(14) v max (S) = arg max z T T,z. 

\\zh=i 

A lesser known but equivalent variational representation is in terms of the 
semidefinite program (SDP) 

(15) Z* = arg max tr (T,Z). 

ZGS^,tr(Z)=l 

For this problem, if the maximal eigenvalue is simple, the optimum is always 
achieved at a rank-one matrix Z* = z*(z*) T , where z* = v max (S) is the max- 
imal eigenvector; otherwise, there exist optimal solutions of higher rank, but 
the optimum is always achieved by at least some rank-one matrix. If we were 
given a priori information that the maximal eigenvector were sparse, then 
it might be natural to solve the same semidefinite program with the addi- 
tion of an £q constraint. Given the intractability of such an ^-optimization 
problem, the SDP program (13) is a natural relaxation. 

In particular, the following result provides sufficient conditions for the 
SDP relaxation (13) to succeed in recovering the correct signed support of 
the maximal eigenvector. 

Theorem 2 (SDP performance guarantees). Impose conditions (6a) 
and (6b) on the sequence of population covariance matrices and sup- 

pose moreover that p n = (3 /(2k) and k = O(logp). Then: 

(a) Rank guarantee: there exists a constant 6 wr = 9 wr (T,(3) such that for all 
sequences (n,p,k) satisfying 6 dia (n,p,k) > 9 wr , the semidefinite program 
(13) has a rank-one solution with high probability. 

(b) Critical scaling: there exists a constant # cr it = # cr it(r,/3) such that if the 
sequence (n,p,k) satisfies 

n 

(16) 6 sdp (n,p,k) ■■= k]og(p _ k) >8c*t 



and if there exists a rank-one solution, then it specifies the correct signed 
support with probability converging to one. 
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Remarks. Part (a) of the theorem shows that rank-one solutions of 
the SDP (13) are not uncommon; in particular, they are guaranteed to ex- 
ist with high probability at least under the weaker scaling of the diagonal 
thresholding method. The main contribution of Theorem 2 is its part (b), 
which provides sufficient conditions for signed support recovery using the 
SDP, when a rank-one solution exists. The bulk of our technical effort is 
devoted to part (b); indeed, the proof of part (a) is straightforward once 
all the pieces of the proof of part (b) have been introduced, and so will be 
deferred to Appendix G. For technical reasons, our current proof (s) require 
the condition k = O(logp); however, it should be possible to remove this 
restriction, and indeed, the empirical results do not appear to require it. 

Proposition 1 and Theorem 2 apply to the performance of specific (polynomial- 
time) methods. It is natural then to ask whether there exists any algorithm, 
possibly with super-polynomial complexity, that has greater statistical effi- 
ciency. The following result is information-theoretic in nature, and charac- 
terizes the fundamental limitations of any algorithm regardless of its com- 
putational complexity. 

Theorem 3 (Information-theoretic limitations). Consider the problem 
of recovering the eigenvector support in the spiked covariance model (5) with 
r = I p . For any sequence (n,p, k) — ► +oo such that 

n 1 + 

(17) e sdp (n,p,k) := m - ( —^ < — 

the probability of error of any method is at least 1/2. 

Remarks. Together with Theorem 2, this result establishes the sharp- 
ness of the threshold (16) in characterizing the behavior of SDP relaxation, 
and moreover, it guarantees optimality of the SDP scaling (16), up to con- 
stant factors, for the spiked identity ensemble. 

To illustrate the predictions of Theorem 2 and 3, we applied the SDP 
relaxation to the spiked identity covariance ensemble, again generating n 
i.i.d. samples. We solved the SDP relaxation using publically available code 
provided by d'Aspremont et al. [9]. Figure 2 shows the corresponding plots 
for the SDP relaxation [9]. Here we plot the probability F[S±(z) = S±(z*)] 
that the SDP relaxation correctly recovers the signed support of the un- 
known eigenvector z* , where the signs are chosen uniformly in { — 1,+1} at 
random. Following Theorem 2, the horizontal axis plots the rescaled sample 
size 9 s dp(n,p,k) = n/[k\og(p — k)\. Each panel shows plots for three differ- 
ent problem sizes, p£ {100,200,300}, with panel (a) corresponding to log- 
arithmic sparsity [k = O(logp)], and panel (b) to linear sparsity {k = O.lp). 



12 



A. A. AMINI AND M. J. WAINWRIGHT 



SDP relaxation (k = 0(log p)) SDP relaxation (k = 0.1 p) 




Control parameter Control parameter 



(a) (b) 

Fig. 2. Performance of the SDP relaxation for the spiked identity ensemble, 
plotting the success probability P[S±(z) = S±[Z )] versus the rescaled sample size 
9sd P (n,P, k) =n/[fclog(p— k)]. The three curves in each panel correspond to model di- 
mensions p £ {100,200,300}, SNR parameter f3 = 3 and sparsity indices k = O(logp) in 
panel (a) and k — O.lp in panel (b). As predicted by Theorem 2, the curves in panel (a) 
all lie on top of one another, and transition to success once the order parameter 9 s dp is 
sufficiently large. 



Consistent with the prediction of Theorem 2, the success probability rapidly 
approaches one once the rescaled sample size exceeds some critical threshold. 
[Strictly speaking, Theorem 2 only covers the case of logarithmic sparsity 
shown in panel (a), but the linear sparsity curves in panel (b) show the same 
qualitative behavior.] Note that this empirical behavior is consistent with 
our conclusion that the order parameter s dp(n, p, k) = n/[k\og(p — k)] is a 
sharp description of the SDP threshold. 



3. Proof of Proposition 1. We begin by proving the achievability result 
(11). We provide a detailed proof for the case r p _fc = I p -k and discuss nec- 
essary modifications for the general case at the end. For £ = 1, . . . ,p, we 
have 

-i n -i n 

(is) 0/ = -£(sj) 2 = -£h/0 4v l + af- 

i=i t=i 

Since (yffizlv 1 + g\) ~ A r (0,/3(^) 2 + l) for each i, the rescaled variate p^yr^i 

Di is central Xn with n degrees of freedom. Consequently, we have 

r 1, for all t G S c , 

nm = \ 1 + P foralHGS, 
^ k 

where we have used the fact that (z^) 2 = 1/k for I G S, by assumption. 
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A sufficient condition for success of the diagonal thresholding decoder is 



a threshold Tk such that Dp > (1 + Tk) for all I £ S, and Di < (1 + Tk) 
all £ € S c . Using the union bound and the tail bound (61) on central x 2 , 
have 



for 
we 



max Dp 



>(1+Tfe) <(p-fc)P 



An 

n 



>1+Tfc 



< (p — k) exp 



3n 2 
16 Tfe 



so that the probability of false inclusion vanishes as long as n > ^(Tfc) 2 x 
log(p — k). 

On the other hand, using the union bound and the tail bound (60b), we 
have 



mm Dp < (1+T k ] 



< 



< 



n 

y 2 

An 

n 
y 2 

An 

n 



1 < 



1 < 



1 + Tfc 



1 



l<Tfe 



1 + P/k 





k 



As long as < /3/A;, we may choose x = j(j 
taining the upper bound 



Tfc) 2 in (60b), thereby ob- 



vnm.Dp < nil + n] 



n 



<ke W [ --[ --Tk 



P 



so that the probability of false exclusion vanishes as long as n > 



■ log A;. 



Overall, choosing = ensures that the probability of both types of error 
vanish asymptotically as long as 



n > max 



Since k = o{p), the log(p — k) term is the dominant requirement. The mod- 
ifications required for the case of general T p _k are straightforward. Since 
var(\/lV)£ = (Tp_k)ee < 1 for all £ G S c and samples i = 1,. . . ,n, we need 
to adjust the scaling of the Xn variates. For general the variates 

{Di,£ £ S c } need no longer be independent, but our proof used only union 
bound, and so is valid regardless of the dependence structure. 

We now prove the converse claim (12) for the spiked identity ensemble. 
At a high level, this portion of the proof consists of the following steps. For 
a positive real t, define the events 



Ai(t) := I max D e >l + t\ and A 2 (t) :~- 



\ min Dp 



<l + i}. 



Noting that the event Ai(t) n A2(t) implies failure of the diagonal cutoff 
decoder, it suffices to show the existence of some t > such that P[Ai(t)] — > 1 
and P[A 2 (t)] -> 1. 
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Analysis of event A\(t). Central to the analysis of event Ai is the fol- 
lowing large-deviations lower bound on % 2 -variates. 

Lemma 4. For a central Xn variable with n degrees of freedom, there 
exists a constant C > such that 



C 

> ^exp(-ni 2 /2) 
'n 



P ^ > l + t 
n 

for allte (0,1). 

See Appendix C for the proof. 

We exploit this lemma as follows. First, define the integer-valued random 
variable 

Z(t) := l \ D t > 1 + *\ 

corresponding to the number of indices £ £ S c for which the diagonal en- 
try D e exceeds l + t, and note that P[A x (t)] = ¥[Z(t) > 0]. By a one-sided 
Chebyshev inequality [15], we have 

(19) P[Ai(i)]=P[Z(i)>0]> 



(E[Z(t)]) 2 + var(Z(t))' 

Note that Z(t) is a sum of (p—k) independent Bernoulli indicators, each with 
the same parameter q(t) := F[Di > l + t]. Computing the mean E[Z(t)] = 
(p ~ k)q(t) and variance var(Z(t)) = (p — k)q(t)(l — q(t)), and then substi- 
tuting into the Chebyshev bound (19), we obtain 

ml> (p-fc)V(Q (p-k)q(t) 



> 1 



(p - kfq\t) + {p- k)q(t)(l - q(t)) ~(p- k)q(t) + 1 
1 



(p -k)q(ty 

Consequently, the condition {p — k)q(t) — ► oo implies that P[Ai(t)] — ► 1. 



Let us set t = y <51 ° s ^ ? [Here <5 G (0, 1) is the parameter from the as- 
sumption k = 0(p 1 ^ s ).] From Lemma 4, we have q(t) > ^= exp(— nt 2 /2), so 



that 



n 
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Since n < Lk 2 log(p — k) for some L < +00 by assumption, we have 



5\og(p-k) \ C (p-k) 1 -' (p-k)^ 



n ) \[~L k \fiog(j> - k) ' 
which diverges to infinity, since 

Analysis of event A2. In order to analyze this event, we first need to con- 
dition on the random vector v := (v 1 , . . . ,v n ), so as to decouple the random 
variables {Dg,£ £ S}. After conditioning on v, each variate nD^l G S, is a 
noncentral x\ u * > with n degrees of freedom and noncentrality parameter 

u* = so that each Dg has mean {y* + n). 

Since v is a standard Gaussian n-vector, we have \\vW2 ~ Xn- Therefore, 

11 i|2 

if we define the event M(v) := {^-^ > §}, the large deviations bound (60a) 
implies that P[B] < exp(— n/16). Therefore, by conditioning on B and its 
complement, we obtain 



C 2 ]< 



+ 



min Df > 1 + 1 

ues 

< (P[x 2 ^ > n(l + i) I B c ]) fc + exp(-n/16), 
where we have used the conditional independence of {Di,l£ S}. Finally, 

1 1 y I j 2 „ 0/3 

since 1L ^ 12 - < § on the event B c , we have v* < ^n, and thus 



[xl^>n(l+t)|B c ]< 



2 f 3/3 



Since i = ^/51og(p — k)/n and n < Lfc log(p — A;), we have t> y | , so 

that the quantity e := min{|,t — |t} is positive for the pre-factor L > 
chosen sufficiently small. Thus, we have 

nxl,w > nil + t) I B c ] < n X l,u* >{n + v*} + ne] 

ne \ / ne 



~ eXP l" 16(1 + 2(3/2)) ) =eXP v 64 

using the x 2 tail bound (63). Substituting this upper bound into (20), we 
obtain 

— —) + exp(-n/16), 

which certainly vanishes if e = \- Otherwise, we have e = t — |r with t = 
s ^°s(p- k ) ^ anc j we neec [ the quantity 
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to diverge to +00. This divergence is guaranteed by choosing n < Lk 2 log(p — 
k) for L sufficiently small. 

4. Proof of Theorem 2(b). The proof of our main result is constructive 
in nature, based on the notion of a primal-dual certificate, that is, a pri- 
mal feasible solution and a dual feasible solution that together satisfy the 
optimality conditions associated with the SDP (13). 

4.1. High-level proof outline. We first provide a high-level outline of the 
main steps in our proof. Under the stated assumptions of Theorem 2, it 
suffices to construct a rank-one optimal solution Z = z z T , constructed from 
a vector with ||z||2 = 1, as well as the following properties: 

(21a) Correct sign: sign(Sj) = sign (z*) for all i G S and 

(21b) Correct exclusion: Zj = for all j G S c . 

Note that our objective function f(Z) = tr(SZ) — p n J2i,j\^ij\ i s concave 
but not differentiable. However, it still possesses a subdifferential (see the 
books [17, 33] for more details), so that it may be shown that the following 
conditions are sufficient to verify the optimality of Z = z z T . 

Lemma 5. Suppose that, for each x G W with \\x\\2 = 1, there exists a 
sign matrix U = U(x) such that: 

(a) the matrix U satisfies 

^22) ^. ._ f signet) sign(zj), if ZiZj 

^ ' lj \g [-1,-1-1], otherwise; 

(b) the vector z satisfies of x T (S — p n U(x))x < z T (H — p n U (x))z. 
Then Z = zz T is an optimal rank- one solution. 

Proof. The subdifferential df(Z) of our objective function at Z = Z 
consists of matrices of the form E — p n U, where U satisfies the condition 
(22). By the concavity of /, for any such U and for all x G M p with ||sc||2 = 1, 
we have 

f(xx T ) < f(Z) + tr((S - p n U)(xx T - Z)). 

Therefore, it suffices to demonstrate, for each iGF with ||x||2 = 1, a valid 
sign matrix U(x) such that tr((S — p n U(x))(xx T — Z)) < 0. Since we have 

tr((E - Pn D{x))xx T ) < tr((E - Pn U(x))Z) 
by assumption (b), the stated conditions are sufficient. □ 
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Remarks. Note that if there is a U independent of x such that z satisfies 
condition (b) of Lemma 5, that is, if z is a maximal eigenvector of E — p n U, 
then the above argument shows that z z T is in fact "the" optimal solution 
(i.e., among all matrices in the constraint space, not necessarily rank one). 

The condition (22), when combined with the condition (21a), implies that 
we must have 

(23) U SS =sign(z* s )sign(z* s ) T . 

The remainder of the proof consists in choosing appropriately the remaining 
dual blocks Uss c and Us<=s c , and verifying that the primal-dual optimality 
conditions are satisfied. To describe the remaining steps, it is convenient to 
define the matrix 

(24) <S>:=Z-p n U-T = [3z*z* T -p n U + A, 

where A := E — E is the effective noise in the sample covariance matrix. We 
divide our proof into three main steps, based on the block structure 

f3z* s z* s T -p n Uss + Ass -p n U SS c + Ass? ' 

— PnUs c S + ^S C S —PnUs c S c + &-S c S c . 

(A) In step A, we analyze the upper-left block $>ss, using the fixed choice 
Uss = sign(^J) sign(z s ) T ■ We establish conditions on the regularization 
parameter p n and the noise matrix Ass under which the maximal eigen- 
vector of has the same sign pattern as z* s . This maximal eigenvector 
specifies the fc-dimensional subvector zs of our optimal primal solution. 

(B) In step B, we analyze the off-diagonal block $>s c Si m particular estab- 
lishing conditions on the noise matrix As?s under which a valid sign 
matrix Us^s can be chosen such that the p-vector z := (%,0s c ) is an 
eigenvector of the full matrix <E>. 

(C) In step C, we focus on the lower right block <&s c S c , m particular ana- 
lyzing conditions on As^s c such that a valid sign matrix Us c s c can be 
chosen such that z defined in step B satisfies condition (b) of Lemma 
5. 

Our primary interest in this paper is the effective noise matrix A = E — E 
induced by the usual i.i.d. sampling model. However, our results are actually 
somewhat more general, in that we can provide conditions on arbitrary noise 
matrices (which need not be of the Wishart type) under which it is possible 
to construct (z, U) as in steps A through C. Accordingly, in order to make 
the proof as clear as possible, we divide our analysis into two parts: in Section 
4.2, we specify sufficient properties on arbitrary noise matrices A, and in 



(25) $ 



&ss c 
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Section 4.3, we analyze the Wishart ensemble induced by the i.i.d. sampling 
model and establish sufficient conditions on the sample size n. In Section 
4.3, we focus exclusively on the special case of the spiked identity covariance, 
whereas Section 4.4 describes how our results extend to the more general 
spiked covariance ensembles covered by Theorem 2. 



4.2. Sufficient conditions for general noise matrices. We now state a 
series of sufficient conditions, applicable to general noise matrices. So as to 
clarify the flow of the main proof, we defer the proofs of these technical 
lemmas to Appendix D. 



4.2.1. Sufficient conditions for step A. We begin with sufficient condi- 
tion for the block (S,S). In particular, with the choice (23) of Uss and 
noting that sign(z<j) = ^fkz* s by assumption, we have 

$ss = (P- Pnk)z* s zf + A ss ■= az s zf + A ss , 

where the quantity a := (3 — p n k < (3 represents a "post-regularization" 
signal-to- noise ratio. Throughout the remainder of the development, we en- 
force the constraint 

(26) Pn = ^, 

so that a = (3/2. The following lemma guarantees correct sign recovery 
[see (21a)], assuming that Ass is "small" in a suitable sense. 



Lemma 6 (Correct sign recovery). Suppose that the upper-left noise ma- 
trix Ass satisfies 

Oi 

(27) I A Sslll 00,00 

with probability 1 as p — > +oo. Then w.a.p. one, the following occurs: 

(a) The maximal eigenvalue 71 := A max ( ( I ) 55) converges to a, and its second 
largest eigenvalue 72 converges to zero. 

(b) The upper-left block <&ss has a unique maximal eigenvector % with the 
correct sign property [i.e., sign(%) = sign(z£)/. More specifically, we 
have 



(28) 
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4.2.2. Sufficient conditions for step B. With the subvector zs specified, 
we can now specify the (p — k) x k submatrix Us^s so that the vector 

(29) z:=(z s ,0 S o)eW 

is an eigenvector of the full matrix <&. In particular, if we define the renor- 
malized quantity % = %/||%||i, and choose 

(30) U S c S = — (A S c S zs)sign(zs) T , 

Pn 

then some straightforward algebra shows that (Agc^ — PnUs c s)zs = 0> so 
that z is an eigenvector of the matrix <I> = j3z*{z*) T — p n U + A. It remains 
to verify that the choice (30) is a valid sign matrix (meaning that its entries 
are bounded in absolute value by one). 

Lemma 7. Suppose that w.a.p. one, the matrix A satisfies conditions 
(27), and in addition, for sufficiently small 5 > 0, we have 

(31) ll|A^5|||oo,2 < 

Then the specified Us c s is a valid sign matrix w.a.p. one. 

4.2.3. Sufficient conditions in step C. Up to this point, we have estab- 
lished that z := (zs,0s c ) is an eigenvector of X — p n U. Thus far, we have 
specified the sub-blocks Uss and Uss c of the sign matrix. To complete the 
proof, it suffices to show that condition (b) in Lemma 5 can be satisfied — 
namely, that for each x G S p , there exists an extension Uscs c ( x ) to our 
sign matrix such that 

z T (S - p n U(x))z > x T (S - Pn U{x))x. 

Note that it is sufficient to establish the above inequality with <&(x) in place 
of S — p n U{x). 1 Given any vector x G S p_1 , recall the definition (24) of 
the matrix $ = and observe that (z) T $(x)z = 71 for any choice of 

Us c s c ( x )- Consider the partition x = (u, v) G 5 p— with wGi' and v G M m , 
where m=p — k. We have 

(32) x T ®x = u T ®ssu + 2v T $s c su + v T ®s c s c v. 

Let us decompose u = pzs + Zg, where \p\ < 1 and z$ is an element of the 
orthogonal complement of the span of %. With this decomposition, we have 

u T $ssu = p 2 z^ssz s + 2fiz^sszs + (2s) T $ss2s 
= M 2 7i + (zs) T $sszj, 

1 ln particular, we have x T Tx < |||r|||2,2 IMl! = max{l, ||| fc ||| 2,2 } || 33 1| i = 1, while 
'z T Y'z = \\zs\\\ — 1; that is, we have x T Tx < z^Flz. 
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using the fact that % is an eigenvector of &ss with eigenvalue 71 by 
definition. Note that \\zg ||| < 1 — /U 2 , so that (zg ) T ^ssZg is bounded by 
(1 — ^ 2 )72, where 72 is the second largest eigenvalue of &ss, which tends to 
zero according to Lemma 6. We thus conclude that 

(33) « T $s<^ <Ai + (1-/^)72- 

The following lemma addresses the remaining two terms in the decompo- 
sition (32). 

Lemma 8. Let m = p — k and let § = {(r/j, be a set of cardinality 
|S| = 0(m). Suppose that in addition to conditions (27) and (31), the noise 
matrix A satisfies, w.p. 1, 



(34) max J v T (A S c S c + T m )v < r, + —I + s V(t ?) £)gS, 
IM|2<»?, v \Jk 

||u||i<£ 

for sufficiently small 5, e > as m — > +00. Then w.p. 1, for all x E 5 P_1 , 
there exists a valid sign matrix Us c S c ( x ) such that the matrix 3>(cc) := j3z*z* T - 
p n U{x) + A satisfies 

(35) x T (<^(x))x<^ 2 a + (l-fj 2 )^<a, 
where \/j,\ = \ x T z\ < 1. 

4.3. Noise in a sample covariance. Having established general sufficient 
conditions on the effective noise matrix, we now turn to the case of i.i.d. 
samples x , . . . , x n from the population covariance, and let the effective noise 
matrix correspond to the difference between the sample and population co- 
variances. Our interest is in providing specific scalings of the triplet (n,p, k) 
that ensure that the constructions in steps A through C can be carried out. 
So as to clarify the steps involved, we begin with the proof for the spiked 
identity ensemble (L = I). In Section 4.4, we provide the extension to non- 
identity spiked ensembles. 

Recalling our sampling model x % = y/]3v l z* + g l , define the vector h = 
n^i=i v% 9 1 - The effective noise matrix A = S — E can be decomposed as 
follows: 



1 n 



\z*z* T 



n . 

■> i=i 



(36) 



n 



VP(z*h T + hz* T ) + ( n" 1 5>V T -/J. 
' n ' i - 
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We have named each of the three terms that appear in (36), so that we can 
deal with each one separately in our analysis. The decomposition can be 
summarized as 

A = (3P+^PR+W. 

The last term W is a centered Wishart random matrix, whereas the other 
two are cross terms from the sampling model, involving both random vectors 
and the unknown eigenvector z* . Defining the standard Gaussian random 
matrix G = {g l j)™f = i ± S M nxp , we can express W concisely as 

(37) W = —G T G - I p . 

n 

Our strategy is to examine each of the terms (3P, y/J3R and W sepa- 
rately. For sub-block Ass, the corresponding sub-blocks of all the three 
terms are present, while for sub-block Ages, only yfflRscs and Ws c s have 
contributions. Since the conditions to be satisfied by these two sub-blocks 
are expressed in terms of their (operator) norms, the triangle inequality 
immediately yields the results for the whole sub-block, once we have es- 
tablished them separately for each of the contributing terms. On the other 
hand, although the conditions on Agcgc (given in Lemma 8) do not have this 
(sub)additive property, only the Wishart term contributes to this sub-block, 
and it has a natural decomposition of the form required. 

Regarding the Wishart term, the spectral norm (|||W|||2,2) of such a ran- 
dom matrix is well characterized [10, 13]; for instance, see claim (38a) in 
Lemma 10 for one precise statement. The following lemma, concerning the 
mixed (oo,2) norms of submatrices of centered Wishart matrices, is perhaps 
of independent interest, and plays a key role in our analysis. 

Lemma 9. Let W € W xp be a centered Wishart matrix as defined in 
(37). Let I,Jc{l,... ,p} be sets of indices, with cardinalities \I\, \ J\ — > oo 
as n,p— > oo, and let Wx,j denote the corresponding submatrix. Then, as 
long as max{| l 7|,log|Z|}/n = o(l), we have 



lll^ IU2=0 (^±«) 

as n,p^> +oo with probability 1. 

See Appendix E for the proof of this claim. 



4.3.1. Verifying steps A and B. First, let us look at the Wishart random 
matrix. The conditions on the upper-left sub-block Wss and lower-left sub- 
block Ws^s are addressed in the following lemma. 
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Lemma 10. As (n,p, k) — ► +00, we have w.a.p. one 

(38a) \\\W SS \h2 = 

(38b) |||Wssimoo = e> 

(38c) \\\Ws^s\U, 2 = 0{ ^ + ^ p - k) ). 

In particular, under the scaling n > Lk\og(p — k) and k = 0{\ogp), the con- 
ditions of Lemmas 6 and 7 are satisfied for Wss and Ws c s f or sufficiently 
large L. 

Proof. Assertion (38a) about the spectral norm of Wss follows directly 
from known results on singular values of Gaussian random matrices (e.g., 
see [10, 13]). To bound the mixed norm |||Ws , cs'||| 00i 2, we apply Lemma 9 with 
the choices I = S c and J = S, noting that \I\ = p — k and \J\ = k. Finally, 
to obtain a bound on ||| Wss loo iDO , we first bound ||| Wss ||| 0^2- Again using 
Lemma 9, this time with the choices X = J = S ', we obtain 

as n, k — > 00. Now, using the fact that for any x £ ||x||2 < V^H^Hco we 
obtain 

IHWsslllocoo = max HWssxHoo < max HWssxHoo = VkfWssloot- 

Iplloo^l ||x||2<vfc 

Combined with the inequality (39), we obtain the stated claim (38b). □ 

We now turn to the cross-term R, and establish the following result. 

Lemma 11. The matrix R = z*h T + hz* T , as defined in (36), satisfies 
the conditions of Lemmas 6 and 7. 

Proof. First observe that h may be viewed as a vector consisting of 
the off-diagonal elements of the first column of a (p + 1) x (p + 1) Wishart 
matrix, say W 1 . This representation follows since hj = ^J2i=i yl 9p where 
the Gaussian variable v l is independent of g^ for all 1 < j < p. For ease 
of reference, let us index rows and columns of W by 1', 1, . . . ,p, let S' = 
{1'}U5, and let h = W[, SuSc . (Recall that S U S c is simply {l,...,p}.) 
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Since the spectral norm of a matrix is an upper bound on the ^-norm of 
any column, we have 



(40) IIM2<|||W£, S ,|||2,2 = 0| 



'jfc + 1 



where we used known bounds [10] on singular values of Gaussian random 
matrices. Under the scaling n > Lklog(p — k), we thus have ||/is||2 — ► 0. By 
Lemma 15, we have P[|W|.-| >t]< Cexp(— cnt 2 ) for t > sufficiently small, 
which implies (via union bound) that 



(41) \\h\\ao = 0\ 



'log(p) 



n 



)-<*)• 



under our assumed scaling. Note also that ||/i||oo — rnax-fH/itjUoo, ||^ , 5 c ||oo}i 
that is, the oo-norm of each of these subvectors are also 0{k~ l l 2 ). Assume 
for the following that L is chosen large enough so that ||/i||oo < S/y/k. 

Now, to complete the proof, let us first examine the spectral norm of 
Rss = ^5^5 + hsZg T . The two (possibly) nonzero eigenvalues of this matrix 
are Zg T hs ± H^JIbll^slk) whence we have 



2< |4 T ^| + ||4ll2||/is||2<2||/i5|| 2 ^0. 



As for the (matrix) oo-norm of Rssi let us exploit the "maximum row 
sum" interpretation, that is, |||i?s , s , |||oo,oo = maxjgs 2~2jes ( c ^- Appendix 
A) to deduce 



oo,oo — Wl^shg |||oo,oo Wl^S^g |||oo,oo 

;T||. , / „|z, A||„*T 



< (max|2*|J||/i s ||i + ^max|/ii|J||z!s ||i 

< -4= III W' s , s , I oo i00 + Halloo V^. 



y/k' 

From the argument of Lemma 10, we have Ull^g/ 5/ |||oo,oo = ®(\J~^)) so that 

and moreover, the norm lll-Rssllloo.oo can be made smaller than 25, by choos- 
ing L sufficiently large in the relation n > Lklog(p — k). 

Finally, to establish the additional condition required by Lemma 7 — 
namely (31) — notice that 

|||-R5 c s|||oo,2 = max \\Rs^sy\\oo 

II 3/11 2 =1 
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max \\hscz* s y\ 
lr.y||2=i 



(,™J^y|)lMoo<^ 



where the last line uses max||j / || 2=1 l-z^yl = ||^g||2 = L thereby completing 
the proof. □ 

Finally, we examine the first term in (36), that is, P. As this term only 
contributes to the upper-left block, we only need to establish that it satisfies 
Lemma 6. 

Lemma 12. The matrix Pss satisfies condition (27) of Lemma 6. 

Proof. Note that for any matrix norm, we have lll-Pss!!! = \n~ l J2?=i( vl ) 2 ' 
l\\\\zgZg T \\\. Now, notice that |||zs2s< T |||2,2 = \ z s Tz s\ = 1- Also, using the "max- 
imum row sum" characterization of matrix oo-norm, we have lll-Zg-ZsHlloo.oo = 
J2j=i K = ' = "7fc)( = ' = '7fc)l = I - New by the strong law of large numbers, \n~ 1 x 
J2?=i( vl ) 2 — 1| — »• as n — > oo. It follows that with probability 1 

111^51112,2 = 111^511100,00 ^0, 

which clearly implies condition (27). □ 

4.3.2. Verifying step C. For this step, we only need to consider the lower- 
right block of W; that is, we only need to verify condition (34) of Lemma 
8 for A 5 c 5c = W S c S c Recall that W = n~ 1 G T G - I p where G is a n x p 
(canonical) Gaussian matrix [see (37)]. With a slight abuse of notation, let 
G S c = (Gij) for 1 < i < n and j G S c . Note that G s - € M nxm where m=p-k 
and 

Asc.sc + Im = Ws<=S c + Im = n^GgcGsc- 
Now, we can simplify the quadratic form in (34) as 

^ T (A 5 =5= + Im)v = y J\\n- 1 / 2 G s °v\\2 = \\n~ 1/2 G S °v\\2 
for which we have the following lemma. 

Lemma 13. For any M > and e > 0, there exists a constant B > such 
that for any set § = {(iji,£i)}i with elements in (0, M) x ~R + and cardinality 
|S| = 0(m), we have 



(42) max \\n' l l 2 G s -v || 2 < r] + B J l -^^£ + g V(fy,£)€S, 

||«||2<J?, V 71 
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as p — > oo, with probability 1. In particular, under the scaling n > Lklogm, 
condition (34) of Lemma 8 is satisfied for L large enough. 

Proof. Without loss of generality, assume M = 1. We begin by control- 
ling the expectation of the left-hand side, using an argument based on the 
Gordon-Slepian theorem [26], similar to that used for establishing bounds 
on spectral norms of random Gaussian matrices (e.g., [10]). First, we re- 
quire some notation: for a zero-mean random variable Z, define its standard 
deviation o~(Z) = (ElZp) 1 / 2 . For vectors x,y of the same dimension, define 
the Euclidean inner product (x,y) =x T y. For matrices X,Y of the same 
dimension (although not necessarily symmetric), recall the Hilbert-Schmidt 
norm 

\\\X\\\ HS :=((X,X))^=(j:xA 1/2 . 

Given some (possibly uncountable) index set {t € T}, let (XtjteT and (YtjtcT 
be a pair of centered Gaussian processes. One version of the Gordon-Slepian 
theorem (see [26]) asserts that if o~(X s — Xt) < o~(Y s — Yf) for all s,t£T, then 
we have 



(43) 


E 


sup Xt 


<E 


suplt 










-t£T J 



For simplicity in notation, define H := Gs? G M nxm , H := n 1//2 G|, and 
fix some 77, £ > 0. We wish to bound 

f(H;rj,£):= max H-fTu^ = max (Hv,u), 

IM|2<r/, IM|2<»), 
||u||i<« l|f|ii<^, 
||m||2=1 

where v £ R m , u £ R n . Note that (Hv,u) = u T Hv = ti(Hvu T ) = ((H,uv T )). 
Consider H to be a (canonical) Gaussian vector in M. mn , take 

(44) T := {t = (u,v) GR"xR m ||v|| a < r/, \\vl\x < £, \\u\\ 2 = 1} 

and define X t = ((H,uv T )} for t G T. Observe that (X t )teT is a (centered) 
canonical Gaussian process generated by H, and f(H;rj,£) = max t6 r J(. We 
compare this to the maximum of another Gaussian process (Yt)tzT, defined 
as Yt = {(g,h), (u,v)) where j£l n and h G W" 1 are Gaussian vectors with 
E[gg T ] =r/ 2 I n and E[h/i T ] = I m . Note that, for example, 

a«g,u)) = (E(g,u) 2 ) 1 ^ 2 = (u T E\gg T ]u)^ 2 = r,\\u\\ 2 , 

in which the left-hand size is the norm of a process ({g,u)) u expressed in 
terms of the norm of a vector (i.e., its index). 
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Let t = (u,v) GT and t' = (u' ,v') G T. Assume, without loss of generality, 
that ||t/|| 2 < 1 1 17 1 1 2 - Then, we have 

a 2 (X t -X t ,) = \\\uv T -u'v' T \\\l s 

= \^uv — uv +uv — uv ||| HS 

II Il2ll /||2 , II /||2|| /||2 

= 1 1 ^ 1 1 2 1 1 ^ — u lb + \\ u II 2 If — V II2 
+ 2(u T u'-\\u'\\ 2 2 )(\\vf 2 -v T v') 
< rj 2 \\u - u'\\l + ||v - v'\\l = a 2 (Y t - Y v ), 

where we have used Cauchy-Schwarz inequality to deduce |u T u'| < 1 = Wu'W?, 
and \v T v'\ < \\v\\2\\v' \\ 2 < II v II I- Thus, the Gordon-Slepian lemma is applica- 
ble, and we obtain 

Ef(H;rj,£) <EmaxY t 

= E max {g,u) + E max (h, v) 

||«[|2=1 IM|2<'7, 
\\v\\i<t 

<E|| 5 || 2 + (E||/j|| 0O )^ 
< \pnr] + (\/3 logm)£, 

where we have used (E||#|| 2 ) 2 < IE( ||^|| 2 ) = ^^ T (99 T ) = ^^(99 T ) =n?f; the 
bound used for EH/iH^ follows from standard Gaussian tail bounds [26]. 

Noting that H = n" 1 / 2 ^ we obtain Ef(H; vJ)<V + V^p^- 

The final step is to argue that f(H;r/,£) is sufficiently close to its mean. 
For this, we will use concentration of Gaussian measure [25, 26] for Lipschitz 
functions in W mn . To see that A — > f(A;r],£) is in fact 1-Lipschitz, note that 
it satisfies the triangle inequality and it is bounded above by the spectral 
norm. Thus, 



\f(H; r),£) - f(F; r,,£)\< f(H - F; 77, £) < \\\H - F||| 2 , 2 < \\\H - F|| 



where we have used the assumption 77 < 1. Noting that H = n~ l ' 2 H and 
f(H; i],£) = rC 1 ! 2 f(H; rj,£), Gaussian concentration of measure for 1-Lipschitz 
functions [25] implies that 

P[f(H; 77, £) - E[f(H; 77, £)} >t)< exp(-nt 2 /2). 

Finally, we use union bound to establish the result uniformly over S. By 
assumption, there exists some K > such that |S| < Km. Thus, 



max (f(H;r),£) - (77 + J (3 log m)/n ■ £)) >t <K exp(-nt 2 /2 + logm) 
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/ 6 lo 771 

Now, fix some e > 0, take t = y and apply the Borell-Cantelli lemma 

to conclude that 



max 



eventually (w.p. 1). □ 




/ 6 log m 

< \ <£, 

V n 



4.4. Nonidentity noise covariance. In this section, we specify how the 
proof is extended to (population) covariance matrices having a more general 

1/2 

base covariance term T p _f t in (5). Let r _ k denote the (symmetric) square 
root of We can write samples from this model as 

(45) x i = ^v i z* + <f, i = l,...,n, 

where 

with g l ~ N(0,I p ) and v l ~ N(0, 1) standard independent Gaussian random 
variables. 

Denoting the resulting sample covariance as E, we can obtain an expres- 
sion for the noise matrix A = E — E. The result will be similar to expansion 
(36) with h and W appropriately modified; more specifically, we have 

(47) h s = h s , h S c = T^ k hsc, 

(48) W ss = Wss, W S cs = T l J* k W S cs, W s °s° = ^)l- k Ws^T l JX 

Note that the P-term is unaffected. 

Re-examining the proof presented for the case T p _k = I p —k, we can iden- 
tify conditions imposed on h and W to guarantee optimality. By imposing 
sufficient constraints on T p _fc, we can make h and W satisfy the same con- 
ditions. The rest of the proof will then be exactly the same as the case 
Tp-fc = Ip-k- As before, we proceed by verifying steps A through C in se- 
quence. 

4.4.1. Verifying steps A and B. Examining the proof of Lemma 11, we 
observe that we need bounds on ||^s||2, H^slli and ||/j||oo = m ax{ 1 1 /is ||oo, ||^s c ||oo}- 
Since h$ = hg, we should only be concerned with \\hgc W^, for which we sim- 
ply have 

||/l5 c ||oo < |||r p ^ fc |||oo,oo||^S c ||oo- 

Thus, assumption (6a) — that is, |||r ' Hloo.oo — ^(1) guarantees that Lemma 
11 also holds for (nonidentity) T. 
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Similarly, for Lemma 10 to hold, we need to investigate ||| W / s c s|||oo,2, since 
this is the only norm (among those considered in the lemma) affected by 
a nonidentity T. Using sub-multiplicative property of operator norms [see 
relation (58) in Appendix A], we have 

oo,oo |||Ws c s|||oo,2, 

so that the same boundedness assumption (6a) is sufficient. 



4.4.2. Verifying step C. For the lower-right block Ws<=s c ^ we first have 
to verify Lemma 13. We also need to examine the proof of Lemma 8 where 
the result of Lemma 13 — namely relation (42) — was used. Let G = (gj)™f = i 1 

and let G s - = (Gij) for 1 < i < n and j G S c . Note that G T SC £ R(p- fc ) xri and 
we have 

= (g x S c ,...,g"g c ) = r*_ 2 fc (ggc ,...,gg c ) = r^j^G^c . 
Using this notation, we can write Ws°s c = n~ 1 Gs< : — ^ P -k = ^p-ki n ~ l G'g c Gs c 

1/2 

I p -k)T p _ k , consistent with (48). 

Now to establish a version of (42), we have to consider the maximum of 

||n- 1 /2G S ct;|| 2 = ||n- 1 /2 Gsc r^ 2 fcV || 2 

1/2 

over the set where \\v\\2 < t] and \\v\\i < i. Let v = F p _ k v and note that for 

1/2 

any consistent pair of vector-matrix norms we have \\v\\ < |||r _ fe ||| \\v\\. Thus, 

1/2 

for example, ||u||2 < V implies \\v\\2 < |||r _ fc ||| 2l 27?) an d similarly for the l\- 
norm. Now, if we assume that Lemma 13 holds for Ggc, we obtain, for all 
(17, £) £ S, the inequality 



max \\n l ^ 2 Gs c v\\2 < max \\n 1 ' 2 Gs°v\ 

(49) 



Hl 2 ^' Il*ll2<|||r^ 2 j|| a ,2»7, 
i< £ 1/2 

ll«lli<ll|r^ 2 J||i,ii 



< ll|r/_ fc |||2,2r/ + B\lTj/_ k \l lt i^ —^-£ + e. 
Thus, one observes that the boundedness condition (6a) guarantees that 

||i r l/2 I,, _ ,|, r l/2 I,, < . 



v- 



fcllll.l — III 1 p_fe III 00,00 ^ ^1; 



thereby taking care of the second term in (49). More specifically, the constant 

A\ is simply absorbed into some B' = BA\. In addition, we also require 

1/2 

a bound on T _ k \\\22, which follows from our assumption 2 2 < 1- 
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However, the fact that the factor multiplying rj in (49) is no longer unity 
has to be addressed more carefully. 

Recall that inequality (42) was used in the proof of Lemma 8 to establish 
a bound on 

*T A * *TtX7" * * J / TtT TT T \ * II TT * Il2 ||*||2 

u As<=S e v =« = « (-ff H - I p ^ k )v =\\Hv \\ 2 -\\v \\ 2 , 

where H = n~ l / 2 Gs^- The bound obtained on this term is given by (76). 
We focus on the core idea, omitting some technical details such as the dis- 
cretization argument. 2 Replacing Ws^s° with Wgogo, we need to establish a 
similar bound on 

v* T W s °s°v* = v* T (n- 1 G^G S c -T p _ k )v* = \\n- 1 / 2 G S oV*\\ 2 - ||rj5fct;* Ill- 
Note that \\v*\\ 2 < \\\^ p ^k\\\2,2\\^l-k v *h or i equivalently, |||r p ^{ 2 ||| ^ll'"*!! 2 < 

1/2 

||T p _k v *\\2- Thus, using (49), one obtains 

II -1/2^ *m2 ht-,1/2 * 1,2 ^ mm 1 / 2 III 2 mi -r^ — 1/2 ... _2 n i i *m2 

||n ' g 5c w || 2 - ||iy_ fc i; || 2 < (|||iy_j| 2i2 - |r p 4 l %2 )\\v \\ 2 

+ (terms of lower order in ||i>*||2)- 

Note that unlike the case T p _k = I p -k, the term quadratic in ||i>*||2 does not 
vanish in general. Thus, we have to assume that its coefficient is eventually 
small compared to (3. More specifically, we assume 

fKn\ iiit^/2 |i|2 nm- 1 /2|n-2 ^ a , n 

(50) lll r p-fcHl2,2 - lll r p_fc III 2,2 < J = g, eventually. 

1/2 1/2 

The boundedness assumptions on |||r p _ fc |||i ) i and |||r _ fc |||2,2 now allows for 
the rest of the terms to be made less than a/4, using arguments similar 
to the proof of Lemma 8, so that the overall objective is less than a/2, 
eventually. This concludes the proof. 

Noting that ||| T^,^, ||| |,2 = A ma x(Lp_fc) and |||r~_^ 2 |||^ 2 = A m i n (r p _j;), we can 
summarize the conditions sufficient for Lemma 8 to extend to general co- 
variance structure as follows: 

(51a) 11^111,1 = 1^100,00 = 0(1); 

(51b) A max (r p _ fc ) < 1; 

& 

(5ic) A max (r p „ fc ) - A m i n (r p _ fc ) < — 

o 

as stated previously. 



2 In particular, we will assume that v* saturates (49), so that [|«*||2 =V- F° r a more 
careful argument see the proof of Lemma 8. 
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5. Proof of Theorem 3. Our proof is based on the standard approach of 

applying Fano's inequality (e.g., [7, 16, 37, 38]). Let § denote the collection of 

all possible support sets, that is, the collection of fc-subsets of {1, . . . ,p} with 

cardinality |S| = (^); we view S as a random variable distributed uniformly 

over 8. Let Pg denote the distribution of a sample X ~ N(0,T, p (S)) from 

a spiked covariance model, conditioned on the maximal eigenvector having 

support set S, and let X n = (x l , . . . ,x n ) be a set of n i.i.d. samples. In 

information-theoretic terms, we view any method of support recovery as a 

decoder that operates on the data X n and outputs an estimate of the support 

S = (j){X n ) — in short, a (possibly random) map 4> : (M p ) n — > §. Using the 0-1 

loss to compare an estimate S and the true support set S, the associated 

risk is simply the probability of error P[error] = X^SeS JWS^S $ ^ Due to 

(k) ^ 

symmetry of the ensemble, in fact we have P[error] = Ws[S ^ S], where S 
is some fixed but arbitrary support set, a property that we refer to as risk 
flatness. 

In order to generate suitably tight lower bounds, we restrict attention to 
the following sub-collection S of support sets: 

S:={5eS|{l,...,fc-l}cS}, 

consisting of those /c-element subsets that contain {1, ...,k— 1} and one 
element from {k,...,p}. By risk flatness, the probability of error with S 
chosen uniformly at random from the original ensemble S is the same as the 
probability of error with S chosen uniformly from S. Letting U denote a 
subset chosen uniformly at random from §, using Fano's inequality, we have 
the lower bound 

I(U; X n ) + log2 

P error > 1 , 

log |S| 

where I(U;X n ) is the mutual information between the data X n and the 
randomly chosen support set U, and |S| = p — k + 1 is the cardinality of S. 

It remains to obtain an upper bound on I(U; X n ) = H(X n ) — H(X n \U). 
By chain rule for entropy, we have H{X n ) < nH{x). Next, using the maxi- 
mum entropy property of the Gaussian distribution [7], we have 

(52) H(X n ) < nH{x) < n| | [1 + log(2vr)] + ^ log det E[xx T ] J , 

where E[xx T ] is the covariance matrix of x. On the other hand, given U = U, 
the vector X n is a collection of n Gaussian p- vectors with covariance matrix 
T, p (U). The determinant of this matrix is 1 + independent of U, so that 
we have 

(53) H(X n \U) = ^[l + log(27r)] + |log(l + /3). 
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Combining (52) and (53), we obtain 



n 



(54) I(U;X n ) < -{logdetE[xx T ] -log(l + /3)}. 

The following lemma, proved in Appendix F, specifies the form of the log 
determinant of the covariance matrix £jy/ :=E[xx T ]. 

Lemma 14. The log determinant has the exact expression 

(5 p — k 



logdet^, = log(l + /3 ) + log^l l + mp _ k+1) 

(55) 

+ ^- k)lo i 1+ H p-k + i) 

Substituting (55) into (54) and using the inequality log(l + a) < a, we 
obtain 



I(U;X r 



<;Mi-^ ,/" fc J +(p-Wi+ 



2 l V l + pk(p-k + l)J ^ ' °V k(p-k + l) 
< n( (3 p-k | P(p-k) 



2{ l + /3k{p-k + l) k{p-k + l) 
n ( /3 2 p — k 



< 



2{l+/3k(p-k + l) 

j3 2 n 



2(1 + 0) k 

From the Fano bound (52), the error probability is greater than | if ^-gf < 
log(p — k) < log which completes the proof. 

6. Discussion. In this paper, we studied the problem of recovering the 
support of a sparse eigenvector in a spiked covariance model. Our analysis 
allowed for high-dimensional scaling, where the problem size p and sparsity 
index k increase as functions of the sample size n. We analyzed two computa- 
tionally tractable methods for sparse eigenvector recovery — diagonal thresh- 
olding and a semidefinite programming (SDP) relaxation [9] — and provided 
precise conditions on the scaling of the triplet (n,p,k) under which they 
succeed (or fail) in correctly recovering the support. The probability of suc- 
cess using diagonal thresholding undergoes a phase transition in terms of 
the rescaled sample size 6din(n',p,k) = n / (k 2 log(p — k)), whereas the more 
complex SDP relaxation, when it has a rank-one solution, succeeds once 
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the rescaled sample size 6 s d p (n,p,k) = n/(klog(p — k)) is sufficiently large. 
Thus, the SDP relaxation has greater statistical efficiency, by a factor of k 
relative to the simple diagonal thresholding method, but also a substantially 
larger computational complexity. Finally, using information-theoretic meth- 
ods, we showed that no method, regardless of its computational complexity, 
can recover the support set with vanishing error probability if 6 s d p (n,p,k) 
is smaller than a critical constant. Our results thus provide some insight 
into the trade-offs between statistical and computational efficiency in high- 
dimensional eigenanalysis. 

There are various open questions associated with this work. Although 
we have focused on a Gaussian sampling distribution, parts of our analysis 
provide sufficient conditions for general noise matrices. While qualitatively 
similar results should hold for sub-Gaussian distributions [5], it would be in- 
teresting to characterize how these conditions change as the tail behavior of 
the noise is varied away from sub-Gaussian. For instance, under bounded mo- 
ment conditions, one would expect to obtain rates polynomial (as opposed to 
logarithmic) in the dimension p. It is also interesting to consider extensions 
of our support recovery analysis to recovery of higher rank "spiked" matri- 
ces, in the spirit of Paul and Johnstone's [32] work on ^-approximation, as 
opposed to the rank-one eigenvector outer product considered here. 

APPENDIX A: MATRIX NORMS 

In this appendix, we review some of the properties of matrix norms, with 
an emphasis on induced operator norms. Recall from (4) that for a matrix 
A G M. mxn , the operator norm induced by the vector norms £ p and l q (on 
M. m and R n , resp.) is defined by 

(56) PIIU= max \\Ax\\ p 

IMI<j =1 

for integers 1 <p, q < oo. As particular examples, we have the ^i-operator 
norm given by |||-A|||i,i = maxi<j< m X)iLi \Aij\, the ^-operator norm by 
|||^4|||oo,oo = maxi<j< n J2JLi \Aij\ and the spectral or ^-operator norm by 
|||^4|||2,2 = max{ (Ti(A)}, where (Ji{A) are the singular values of A. 

As a consequence of the definition (56), for any vector x G W 1 , we have 

(57) ll^llp<Plllp,9lMI 9 . 

a property referred to as ||| • ||| P) g being consistent with vector norms || • || p and 
|| • \\ q (on M. m and M n , resp.). It also follows from the definition, using (57) 
twice, that operator norms are consistent with themselves, in the following 
sense: if A G R mxn and B G R nxk , then 

(58) I A-BlUp^ < |||j4|||p )r |||.B||| ri q 
for all 1 < p, q,r < oo. 
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We can also apply any vector norm to matrices, treating them as vec- 
tors, by concatenating their columns together. For example, we will use the 
following mixed-norm inequality 

(59) H^^lloo < |||-4|||oo,oo Halloo, 

where ||-B||oo := niaxjj \Bij\ is the elementwise ^oo-norm, and A and B are as 
defined above. For the proof, let 61, . . . , b/. denote the columns of B. Then, 

HABHoo = ||[^6i,...,A6fe]|| 00 = max ||A6i||oo 

Ki<k 



— Ill ^4 III 00 ,00 max ||frj||oo — III ^4 HI 00 ,00 ||-B||oo- 

i<«<p 

For more details, see the standard books [18, 34]. 

APPENDIX B: LARGE DEVIATIONS FOR CHI-SQUARED VARIATES 

The following large-deviations bounds for centralized y 2 ar e taken from 
Laurent and Massart [24]. Given a centralized y 2 -variate X with d degrees 
of freedom, then for all x > 0, 

(60a) P[X - d > 2\/rfx + 2x] < exp(-x) and 

(60b) F[X-d< -2\fdx] <exp(-x). 

We also use the following slightly different version of the bound (60a), 

(61) ¥[X-d>dx) <exp(-^dx 2 ), 0<x<±, 

due to Johnstone [19]. More generally, the analogous tail bounds for noncen- 
tral y 2 , taken from Birge [4], can be established via the Chernoff bound. Let 
A be a noncentral y 2 variable with d degrees of freedom and noncentrality 
parameter v > 0. Then, for all x > 0, 



(62a) F[X > (d + v) + 2J (d + 2v)x + 2x] <exp(-x) and 



(62b) F[X <{d + u)- 2 s J(d + 2u)x] < exp(-x). 

We derive here a slightly weakened but useful form of the bound (62a), valid 
when v satisfies v <Cd for a positive constant C. Under this assumption, 
then for any d 6 (0, 1), we have 

(63) ¥[X>(d + u) + MV5]<exp(-Y^^dj. 

To establish this bound, let x = for some 8 G (0,1). From (62a), we 
have 



d 2 

p- :=r X>(d + v) + 2d\ r 5 + 2 5 

L d + 2u 

- exp ("TT2C d 



< exp 



d 2 5 
' d + 2u 
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Moreover, we have 

p* > F[X >{d + u) + 2d^6 + 2dS] > F[X >{d + v) + 4dVS\, 
since y/~5 > 5 for 5 G (0, 1). 

APPENDIX C: PROOF OF LEMMA 4 
Using the form of the Xn PDF, we have, for even n and any t > 0, 



Xn 

n 



>l + t 



1 



2™/ 2 F(n/2) J(i+t) n 

1 f (n/2-1)! 



x n/2 - 1 exp(-x/2)dx 



2«/2r(n/2) [ (l/2)W2-i)+i' 6Xp 

"exp(-n/2)(n/2) n / 2 " 1 



> exp(-ni/2) 



n I j.\ \ n/2-l 
^ i=0 
(1 + t)"/ 2 " 1 , 



1 /n(l + t)~ 



(n/2-1)! 

where the second line uses standard integral formula (cf. Section 3.35 in the 
reference book [14]). Using Stirling's approximation for (re/2 — 1)!, the term 
within square brackets is lower bounded by 2C/y/n. Also, over t £ (0, 1), we 
have (1 + fp 1 > 1/2, so we conclude that 



(64) 



Xn 

n 



>l + t 



C 

> ^exp 

'n 



n 



t-log(l + t)] • 



Defining the function f(t) = log(l + t), we calculate /(0) = 0, /'(0) = 1 and 
f"(t) = -1/(1 + 1) 2 . Note that f"(t) > -1, for all t £ M. Consequently, via 
a second-order Taylor series expansion, we have /(f) - t > -t 2 /2. Substitut- 
ing this bound into (64) yields 



X 



^>l + t 



n 



C 

> — exp 
n 



as claimed. 



APPENDIX D: PROOFS FOR SECTION 4.2 

D.l. Proof of Lemma 6. The argument we present here has a determin- 
istic nature. In other words, we will show that if the conditions of the lemma 
hold for a nonrandom sequence of matrices Ass, the conclusions will follow. 
Thus, for example, all the references to limits may be regarded as deter- 
ministic. Then, since the conditions of the lemma are assumed to hold for a 
random Ass a.a.s., it immediately follows that the conclusions hold a.a.s. To 
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simplify the argument let us assume that a - |||^s r S'|||oo,oo < £ for sufficiently 
small e > 0; it turns out that e = is enough. 

We prove the lemma in steps. First, by Weyl's theorem [18, 34], eigen- 
values of the perturbed matrix az s z* s T + Ass are contained in intervals 
of length 2|||Ass|||2,2 centered at eigenvalues of az s z s T ■ Since the matrix 
z s z s T is rank one, one eigenvalue of the perturbed matrix is in the inter- 
val [a ± |||Ass|||2,2]j and the remaining k — 1 eigenvalues are in the interval 
[0 ± I Ass III 2,2]- Since by assumption 2 1|| Ass III 2,2 < ol eventually, the two in- 
tervals are disjoint, and the first one contains the maximal eigenvalue 71 
while the second contains the second largest eigenvalue 72. In other words, 
1 7i — ol\ < I A S s III 2,2 and I72I < ||| Ass III 2,2- Since ||| ||| 2 ,2 —> by assump- 
tion, we conclude that 71 — > a and 72 — ► 0. For the rest of the proof, take n 
large enough so 

(65) ^aT 1 -\\<e, 

where e > is a small number to be determined. 

Now, let z~s £ ^ fc with ||%||2 = 1 be the eigenvector associated with 71, 
that is, 

(66) (az s z* s T + A ss )zs = 7i%- 

Taking inner products with zs, one obtains a{z s T zs) 2 + ZgAsszs = 71 ■ Not- 
ing that \z$AssZs\ is upper-bounded by ||| Ass|||2,2 ; we have by triangle in- 
equality 

\a - a(z s T z s ) 2 \ = \a - 71 + 71 - a(z s T z s ) 2 \ 

< \a -71I + I71 - a(z s T z s ) 2 \ < 2|||A SS ||| 2 ,2, 

which implies z* s T zs — ► 1 (taking into account our sign convention). Take n 
large enough so that 

(67) \zfz s -\\<e 
and let u be the solution of 

(68) az* s + Assu = au, 

which is an approximation of (66) satisfied by zs- Using triangle inequality, 
one has ||xt||oo < \\z s \\oo + a _1 ||| Ass III 00,00 1 1 1* 1 1 00 7 which implies that 

(69) \\u\\oo < (1 - ck _1 ||| A^IHoo^oo) -1 Halloo < (1 - e) -1 ||2sl|oo- 
We also have 

(70) \\u - ZgWoo < a -1 ! Ass|||oo,oo|M|oo < e(l - e) -1 \\z* s \\oc- 
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Subtracting (68) from (66), we obtain az* s (z*^zs — 1) + Ass(% — u) = 71% — 
au. Adding and subtracting 71 u on the right-hand side and dividing by a, 
we have 

z s( z S T zs - 1) + a^Assizs - u) =7 ia - 1 (% -u) + (71a" 1 - l)u, 
which implies 

||% - «||oo < (l7i« _1 | - a'^II^S'sllloo.oo)" 1 

x {\zfz s - 1| • Halloo + |7ia _1 - 1| • Nloo} 
<(l-2e)- 1 [e + e(l-e)- 1 ]-||4lloo, 

where the last inequality follows from (65), (67) and (69). Combining with 
the bound (70) on \\u — Zg\\oo yields 

11% - z*s\\oo < g _^ f e 



4l|oo ~l-2e (l-2e)(l-e) 1-e 
3e 



< 



(l-2,)2- 



Finally, we take e = ^ to conclude ||% — Zg\\oo < 5 Halloo = ^77^ a.a.s., as 
claimed. 

D.2. Proof of Lemma 7. Recall that by definition, % = %/||%||i. Using 
the identity sign(%) T % = ||%||i yields Us^szs = Pn l ^s c szs, which is the 
desired equation. It only remains to prove that Us^s is indeed a valid sign 
matrix. 

First note that from (28) we have \zi\ G [j^p 2\7fc^ ^' wn i cn implies 

that ||%||i G ^p]. Thus, ||%|| 2 = l/(||%||i) < -j=. Now we can write 

max \Uij\ < / o^ 1 ||A Sc s%||oo 

< Pn^llAscslHoo^l^slh 

^. 2fc <5 2 
so that taking 5 < j completes the proof. 



D.3. Proof of Lemma 8. Here we provide the proof for the case T p _k = 
Ip-k', necessary modifications for the general case are discussed in Section 
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4.4. First, let us bound the cross-term in (32). Recall that z$ = %/||%||i. 
Also, by our choice (30) of Us c s, we have 

$ S c S = A S c S - PnUs^s = A 5 c5 - A S c S zs sign(%) T . 

Now, using sub-multiplicative property of operator norms [see relation (58) 
in Appendix A], we can write 

III $S c S III 00,2 = |||A5c 5 (/ p _ fc - 2ssign(%) T ) 

< |||As^|||oo,2 • \\\{I P -k - zssign(%) T )||| 22 

(71) 

< |||A5=5|||oo,2 • (1 + \zs 2 sign(%) T z 5 |) 

< 3\\\As c s\\\oo,2, 

where we have also used the fact that |||afr T |||2,2 = Halhll&lb, and II II 2 = 
l/(||%||i) < using the bound (28). Recall the decomposition x = (u,v), 

where u = pzs + with p? + ||| < 1- Also, by our choice (30) of Us c s, 
we have &s c su = &S c SZs • Thus, 

rp rp 

max|2-y ^s c S u \ < ma x \2v Qs^S'^A 

(72) 



<\/l — M max \2v $«cc"u|. 

V II'"I|2<1 

Using Holder's inequality, we have 

rp 

max \2v ^s c su < 2 \\v h max L 

II«I|2<1 II«I|2<1 

(73) <2||t»||ip^s|||oo,2 

^fill II 6 

where we have used bound (71) and applied condition (31). We now turn 
to the last term in the decomposition (32), namely v T ^s c s c v = v Agcgcv — 
p n v T Us c S cV - I n order to minimize this term, we use our freedom to choose 
Us c s c {x) = sign(v) sign(v) T , so that —p n v T Us c s c v simply becomes — /? n |Mli- 
Define the objective function /* := max^r 3>x. Also let H = rT 1/2 G S c, 
where Ggc = (Gij) for 1 < i < n and j £ 5 C . Noting that Agcgc = H T H — I m 
(with m = p — k) and using the bounds (33), (72) and (73), we obtain the 
following bound on the objective: 

i T $SSU + max - 

u u,v 
.2_. , /-, ,,2\ 



f* < max it &ss u + niax2-y &s c S u + niaxu §s c S cV 
u u,v 

(74) < [/i 2 7l + (1 " M 2 )72] 
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6 



2^ 



max i 6||f p + ||-Hf H2 — IMI2 ~~ Pnll^Hi 
IMl2<i I v k 



9* 

In obtaining the last inequality, we have used the change of variable v — > 
— [J?)v, with some abuse of notation, and exploited the inequality \\v\\2 < 
\J\ — fi 2 . (Note that this bound follows from the identity ||x||2 = 1 = [J? + 

11^111 + IN!-) 

Let v* be the optimal solution to problem g* in (74); note that it is 
random due to the presence of H. Also, set § = where i and j 

range over {1,2,..., [y / m] } and 

- _L 9 - _L ■ 

Jm Jm 



Note that § satisfies the condition of the lemma, namely |S| = [y'm] 2 = 
<D(m). 

Since ||u*|| 2 < 1, and ||«*||2 < ||^*||i < V^ll^lh) there exists 3 (r?*,f ) G § 
such that 

* ^11*11 ^- * 

v — 1= < \\v 2 < n , 

Jm 



r-3< \\v*\\i<t. 

Thus, using condition (34), we have 

\\Hv*\\ 2 < m a x \\Hv\\ 2 <ri* + —=C + e 

\\v\\2<ri*, Jk 
|M|i<^* 

< || u *|| 2 + -L + _(|| v *|| 1 + 3) + e. 
Jm Vk 

To simplify notation, let 

(75) A = A(e, 5, m, k) := 1/^+35/Vk + e, 

so that the bound in the above display may be written as ||v*||2 + £||v* \\i/Vk + 
A. Now, we have 

11^*111- Fill <»*l| 2 (^ + a) + (,Mi + ^ 



3 Let i* = \i/m\\v* Ha] and rf — -j^. Using the fact that, for any \x] — 1 < x < 

\x], we have rf — 1/Jm< 1 1 1?* ] 1 2 < 77* or, equivalently, ||u*||2 = if + £ where — l/ v / m< 
£< 0. Now let j* = r^fjl- One has (j* - l)]|v*|| 2 < ||v*||i <f\\v*h which, using the 
fact that \\v*\\2 < 1, implies J*||u*||2 — 1 < ||«*||i < J*||u*||2. This in turn implies 

3*V* +3*Z~ 1 < ||v*||i < fn ■ 
Take I* — j*rf and note that j*^—l> —3, since j* is at most \Jm] . 
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(76) 



<2 5 



KIU 
Vk 



+ A)+ 5 



Kill 
Vk 



+ A 



Using this in (74) and recalling from (26) that p n = (3/(2k), we obtain the 
following bound: 



g* < 65 



Kill 



+ 2 8 



Kill 



+ A)+ 5 



Kill 



+ A 



13 f\\v* 



Vk 



Note that this is quadratic in that is 

2 



9 <a 



Kill 



+ b 



\v ||i 



+ c, 



where 



25 A and c = 2A + ,4 2 



By choosing 5 sufficiently small, say 5 2 < /3/4, we can make a negative. This 
makes the quadratic form ax 2 + bx + c achieve a maximum of c + 6 2 /4(— a), 
at the point x* = b/2(—a). Note that we have b/2(—a) — ► and c — > as 
e,<5 ^ and m,k — > oo. Consequently, we can make this maximum (and 
hence <?*) arbitrarily small eventually, say less than a/2, by choosing 5 and 
e sufficiently small. 

Combining this bound on g* with our bound (74) on /*, and recalling 
that 71 — ► a and 72 — ► by Lemma 6, we conclude that 



/* < /U 2 (a + o(l)) + (l-/x 2 ) 



< a + o(l) 



as claimed. 



APPENDIX E: PROOF OF LEMMA 9 

In this appendix, we prove Lemma 9, a general result on HI • \\\(yo 2~nc)rm 
of Wishart matrices. Some of the intermediate results are of independent 
interest and are stated as separate lemmas. Two sets of large deviation 
inequalities will be used, one for chi-squared RVs Xn anc ^ one f° r " sums of 
Gaussian product" random variates. To define the latter precisely, let Z\ 
and Z2 be independent Gaussian RVs, and consider the sum Ya=i^i where 
X{ Z\Z%, for 1 < i < n. The following tail bounds are known [4, 21]: 



(77) 
(78) 



n 



i=l 



>t < Cexp(-3nt 2 /2) ast^O; 



\n l xl 



1| > t) < 2exp(-3nt 2 /16), 0<i<l/2, 
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where C is some positive constant. 

Let W be a p x p centered Wishart matrix as denned in (37). Consider 
the following linear combination of off-diagonal entries of the first row: 

n n p 

3=2 i=\ j=2 

Let f := \\a\\2 l Y, P j=2 9 i j a j> where a = (a 2 , ■ ■ ■ ,a p ) G W 1 . Note that {f }™=i 
is a collection of independent standard Gaussian RVs. Moreover, {£*}™ =1 is 
independent of {<7i}f =1 . Now we have 

p n 

Y^cijWij =n- 1 \\a\\ 2 Y J 9\€ , 

3=2 i=l 

which is a (scaled) sum of Gaussian products (as defined above). Using (77), 
we obtain 

(79) F ^J2 a 3 W V >tj<Cexp(-3ni?/2\\a\\l). 

Combining the bounds in (79) and (78), we can bound a full linear com- 
bination of first-row entries. More specifically, let x = {x\, . . . , x p ) £ ffi p , with 
x\ ^ and J2^=2 x j 0j an d consider the linear combination Yfj=\ x j W\j ■ 
Noting that W\\ = n~ 1 J2i(9i) 2 — 1 is a centered Xn> we obtain 





p 






p 








>t 


<F^\xiWn\ + 




>•) 




3=1 






3=2 





<F[\x 1 W n \>t/2]+: 



3=2 



< 2exp 



3nt 2 



16 • 4xj 
< 2max{2, C} exp 



+ C exp 



>t/2 
3nt 2 



3nV 



16-4E^=i^ 2 



Note that the last inequality holds, in general, for x 7^ 0. Since there is 
nothing special about the "first" row, we can conclude the following. 

Lemma 15. For t > small enough, there are (numerical constants) 
c> and C > such that for all x £ W \ {0}, 

P \ 

> t < Cexp(-cnt 2 /\\x\\l) 



(80) 
for 1 < i <p. 



Y. x 3 W k 

3=1 
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ijXj\ 



Now, let I, J C {1, . . . ,p} be index sets, 4 both allowed to depend on p 
(though we have omitted the dependence for brevity). Choose x such that 
xj = for j $l J and ||xj-||2 = 1. Note that ||Wx, 1 7a>r||oo = niax, e x | J2jej ^ 
maxjgi | Y^j=i WijXj\, suggesting the following lemma. 

Lemma 16. Consider some index set I such that \I\ — > oo and n" 1 log \I\ 
as n,p^ oo, and some Xj G S^' -1 . Then, there exists an absolute con- 
stant B > such that 



'log |J| 



n 



(81) WWijXjWnKB^ 
as n,p^> oo, with probability 1. 

Proof. Applying the union bound in conjunction with the bound (80) 
yields 

E W H X 3 



(82) P(max 



> t < |J|Cexp(-cnr). 



Letting t = B^Jn -1 log jX , the right-hand side simplifies to Cexp(— (cB 2 — 
1) log \2\). Taking B > v / 2c _1 and applying Borel-Cantelli lemma completes 
the proof. □ 

Note that as a corollary, setting xj = (1, 0, . . . , 0) yields bounds on the 
oo-norm of columns (or, equivalently, rows) of Wishart matrices. 

Lemma 16 may be used to obtain the desired bound on ||| Wx,^ |||oo,2 ■ For 
simplicity, let y G PJ^' represent a generic iJ^-vector. Recall that ||| VFx,j7 |||oo,2 = 
max j/eSl J l- 1 l|Wx,J'2/lloo- We use a standard discretization argument, cover- 
ing the unit ^ 2 -ball of PJ^' using an e-net, say jV. It can be shown [27] that 
there exists such a net with cardinality \J\f\ < (3/e)'^L For every y G S 1 '^' -1 , 
let u y G TV be the point such that \\y — u y \\2 < e. Then 

II W x ,jy\\oo ,J u y\\oo 
< |||Wj !i 7||| 00)2 e+ HWj^tiylloo. 

Taking the maximum over y G S"^" 1 and rearranging yields the inequality 

(83) 111^11100,2 < (1 - e^maxWWx^uWoo. 



4 We always assume that these index sets form an increasing sequence of sets. More 
precisely, with I = I P , we assume Ii CI2 C ■■■ . We also assume \I P \ — > 00 as p — > 00. 
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Using this bound (83), we can now provide the proof of Lemma 9 as 
follows. Let M = {ui, . . . , u\j\f\} be a |-net of the ball S'^'" 1 , with cardinality 
|JV| < Then, from our bound (83), we have 



P(|||Wr, i7 ||| 00 , 2 >t)< P 2 max WWj^juW^ > t 

<\M\-F(\\W I:J u 1 \\ 00 >t/2) 
< d J \-C\I\ exp(-cnt 2 /4). 

In the last line, we used (82). Taking t = D" v^+V 1 "^ with D " large 
enough and using Borel-Cantelli lemma completes the proof. 



APPENDIX F: PROOF OF LEMMA 14 
The mixture covariance can be expressed as 

S M := E[xx T ] = E[E[xx T \U]} 

= V -Le[xx t \U = S] 
~|S| 



J2-L-(I P + (3z*(S)z*(S) T ) 



1 1 5es 1 1 



where 



Yij = Y^Vkz^SUVkz^Sf^ = £ t{i g S}l{j e S} 

SeS 

Let R := {1, . . . , k — 1} and R c := {k, . . . ,p}. Note that we always have Rc S 
for S G §. In general, we have 

f|S|, if both i,j G R, 
Yij = < 1, if exactly one of i or j G R, 
U, if both i,j £R. 
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Consequently, Y takes the form 



Y 



/|S| 

JS| 
1 
1 



1 1 



1 1 



1 
1 



1\ 





or Y ■ 



\ 1 ••• 1 ••• 1/ 

where 1 R , for example, denotes the vector of all ones over the index set R. 
We conjecture an eigenvector of the form 

\r 

and let us denote the associated eigenvalue as A. Thus, we assume Yv = Xv, 
or, in more detail, 

\E\\R\l R + b\R c \l R = Xl R , 

\R\Irc +bl R c = Xbl R c, 

where we have used, for example, 1^1^ = \R\. Note that \R C \ = |S| =p — k + 
1. Rewriting in terms of |S|, we get 

|S|(|i2|+6) = A, 
\R\ + b = Xb 

from which we conclude, assuming A ^ 0, that b = i. This, in turn, implies 
A = |S|| J R| + 1. 

Thus far, we have determined an eigenpair. We can now subtract X(v /\\v \\2)(y/ 
\\ V \\2) T = (■VIMIi)' 1 '^ an d search for the rest of the eigenvalues in the re- 
mainder. Note that 



A 



A 



|S||2Z| + 1 



II 1^1 + 62^1 1^ + 



\S\. 



Thus, we have 



tVV 



implying 



LrcLr 







-iR ci - R c 



Y 



\v\\ 2 
\ v \\2 



VV 



1 







/-— l R cfr c ) • 



44 



A. A. AMINI AND M. J. WAINWRIGHT 




The nonzero block of the remainder has one eigenvalue equal to 1 — ^ J = 

\s\ 

and the rest of |i? c | — 1 of its eigenvalues equal to 1. Thus, the remainder 
has + 1 of its eigenvalues equal to zero and |i? c | — 1 of them equal to one. 
Overall, we conclude that eigenvalues of Y are as follows: 

1 time, 

|i? c | — 1 times, 
\R\ times 
or 

(p-k + l)(k-l) + l, 1 time, 
1, p — k times, 

0, k — 1 times. 

The eigenvalues of Y are mapped to those of £m by the affine map x — > 

1 + S~-x, so that Sm has eigenvalues 

fc|S| b 

m) l. fl*- 1 ) , g 1 1 p 1 

K ' ^ k ^ k{ P -k + \y ^ k{ P -k + \y 

with multiplicities 1, p — k and k — 1, respectively. The log determinant 
stated in the lemma then follows by straightforward calculation. 

APPENDIX G: PROOF OF THEOREM 2(A) 

Since in part (a) of the theorem we are using the weaker scaling n > 
9 wr k 2 log(j> — k), we have more freedom in choosing the sign matrix U. We 
choose the upper-left block Uss &s in part (b) so that Lemma 6 applies. Also 
let z := (zs,0s c ) as in (29), where zs is the (unique) maximal eigenvector 
of the k x k block &ss] it has the correct sign by Lemma 6. We set the 
off-diagonal and lower-right blocks of the sign matrix to 

(85) Us c s = — Ages, Us<=s c = — ^s c s c , 

Pn Pn 

so that &s c S — and &s c S c = 0. With these blocks of $ being zero, z is the 
maximal eigenvector of hence an optimal solution of (13), if and only if z~s 
is the maximal eigenvector of &ss] the latter is true by definition. Note that 
this argument is based on the remark following Lemma 5. It only remains 
to show that the choices of (85) lead to valid sign matrices. 

Recalling that vector oo-norm of a matrix A is ||^4||oo := maxjj \Aij\ (see 
Appendix A), we need to show ||C/ses||oo < 1 and \\Uscs^ \\oo < 1- Using the 
notation of Section 4.4 and the mixed-norm inequality (59), we have 
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_ VP\\Z II || * || 

— 1 1 n S c I |oo \\Zg | |oo 

Pn 

<? "^lllr 1 / 2 ill \\h ll ll * ll 

_ Ill-l p_fc|||oo,oo ||"-S C ||oo H^slloo 
Pn 



where the last line follows under the scaling assumed and assumption (6a) 

1/2 

on HI HI oo,oo - For the lower-right block, we use the mixed-norm inequality 
(59) twice together with symmetry to obtain 

^ 1 ' — ' 1 1/2 1/2 

||^5 c 5 c llcx) = — l|w / s c s ,c lloo = — ||r p _ fc W5c5 C r J3 _ A .|| 0O 

Pn Pn 

<r 1 lllr 1 / 2 |||2 || w || 

< — Pp-felllocooH^SHloo 
Pn 



which can be made less than one by choosing 6 WT large enough. The bound 
on HWsc^clloo used in the last line can be obtained using arguments similar 
to those of Lemma 9. The proof is complete. 
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